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ABSTRACT 



This paper is a slightly modified and reduced version of the proposal of the 
apeNEXT project, which was submitted to DESY and INFN in spring 2000. 
It presents the basic motivations and ideas of a next generation lattice QCD 
(LQCD) computing project, whose goal is the construction and operation of 
several large scale Multi-TFlops LQCD engines, providing an integrated peak 
performance of tens of TFlops, and a sustained (double precision) performance 
on key LQCD kernels of about 50% of peak speed. The software environment 
supporting these machine is organized in such a way that it allows relatively 
easy migration between apeNEXT and more traditional computer systems. 
We describe the physics motivations behind the project and the hardware and 
software architecture of the new LQCD engine. 



3 



1 Introduction 



Several research groups in the Lattice QCD (LQCD) community have devel- 
oped LQCD optimized massively parallel processors Q. These systems have 
provided in the last decade a significant fraction of all compute cycles available 
all over the world for lattice simulations. In this framework, INFN and DESY 
have developed the APEmille parallel processor. APEmille is a LQCD oriented 
massively parallel number-cruncher Q , providing peak performance of several 
hundred Gflops. The first APEmille systems have been commissioned in late 
1999. More machines have become available since then and yet a few more will 
be built in the near future (see later for details). 

We expect APEmille machines to become the work-horse for LQCD com- 
puting in several laboratories in Europe in the next few years. It is however 
clear (and explained in detail in a following section) that APEmille is unable to 
support serious LQCD simulations at the level expected after the year 2003. 

The continuing physics motivation to pursue numerical studies of lattice 
QCD and the level of needed computing resources have been analyzed in detail 
by a review panel appointed by the European Committee for Future Accelerator 
(ECFA) 1^. We fully endorse the conclusions of the ECFA report (which can 
be regarded as an ideal introduction to the present document). In this paper we 
present a proposal for a new lattice QCD project that builds on the experience 
of the previous generation APE machines and tries to implement several of the 
recommendations of the ECFA panel. This paper is an enlarged and improved 
version of a preliminary proposal |^ , submitted to the INFN Board of Directors 
in summer 1999. 

The new project (that we refer to as apeNEXT) is characterized by the 
following architectural goals: 

• an expected peak performance for large machines in excess of 5 TFlops, 
using double precision floating point arithmetics. 

• a sustained (double precision) efficiency of about 50% on key LQCD ker- 
nels (such as the inversion of the Dirac operator). 

• a large on-line data storage (512 GByte to 1 TByte for large machines). 

• input/output channels able to sustain a data-rate of 0.5 MByte/sec/GFlops. 

• a programming environment that allows relatively straightforward and 
easy migration of physics codes between apeNEXT and more traditional 
computer systems. 
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From the point of view of the organization of the project, the fohowing points 
are in order: 

• the apeNEXT architecture will be very closely optimized to LQCD sim- 
ulations. In other words, apeNEXT will be more tuned towards LQCD 
than APEmille. 

• The general know-how of APEmille, as well as several important building 

blocks, will be heavily reused in the new project (properly rescaled to keep 
technology advances into account). This is a key point that we plan to 
leverage on, in order to shorten development time. 

• We plan from the beginning the installation of several large machines at 

approximately the same time at several collaboration sites. (Collaboration 
membership is also somewhat enlarged in comparison with APEmille). 
Stated otherwise, we plan to build up very high processing performance 
for LQCD (of the order of several tens of TFlops) by operating in a loosely 
coordinated way several machines. 

• Provisions to facilitate an industrial exploitation of the project are not one 
of the stated goals of the project. We do see however that several building 
blocks of the project (most notably in the area of inter-node communi- 
cations) may have an important impact on other areas of computing for 
physics (and, more generally, for cluster computing or farming). We will 
do our best to make our results reusable. 

This paper describes the hardware and software architecture that we plan to 
develop. It does not cover the organization of the project, the proposed schedule 
of our activities and any financial issues. These points are considered elsewhere. 
The paper is organized as follows: 

• Section 2 discusses the physics goals of the project and their correspond- 
ing computing requirements (in terms of processing performance, data 

storage, bandwidth). 

• Section 3 briefly summarizes the APEmille architecture and substantiates 
the need for a new project. 

• Section 4 reviews similar planned or started projects. 

• Section 5 surveys the status and prospects of some enabling technologies 
for our project. 

• Section 6 discusses advantages and disadvantages of custom versus off- 
the-shelf technologies for the processing element of the new computer. 
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• Section 7 presents the global architecture of our new massively parallel 
LQCD machine. 

• Section 8 describes the details of the processing node. 

• Section 9 covers the architecture of the interconnection network. 

• Section 10 discusses several possible options for the topology and the 
mechanical set-up of the system. 

• Section 11 is the first section on software. Here we describe the pro- 
gramming environment that we plan to develop for apeNEXT. 

• Section 12 is a matching section discussing the operating system and 
other system-software issues. 

• Section 13 reviews the design method that we plan to follow in the 
development of the system. 

• Section 14 contains our conclusions. 



2 Physics Requirements 

In the definition of the new project we keep a clear focus on a very limited 
number of important physics simulation areas, that set the requirements for the 
new project. 

The translation of physics requirements into machine parameters requires 
certain assumptions about the algorithms to be used. We base our consider- 
ations on tested algorithms such as SSOR-preconditioned BiCGstab and Hy- 
brid Monte Carlo, for Wilson fermions with improved action New theoreti- 
cal developments (domain wall fermions, Wilson-Dirac operators satisfying the 
Ginsparg- Wilson relation, etc.) are likely to be implemented in a way which has 
computational characteristics very similar to the standard Dirac operator. 

We expect that in the years 2003-2006, large production LQCD simulations 
will be mainly focused on the following lines: 



• full QCD simulations (including dynamical fermions) on lattices with sizes 
of the order of 48^ x 96 (a physical system oi L — 2—4 fm and a = O.l-f-O.OS 
fm). Dynamical quark masses should also decrease, with a reasonable 
target corresponding to rriTr/mp ~ 0.35 (although it is not realistic to 
expect that both goals are obtained in the same simulation). 
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• simulations in the quenched approximation on very large lattices (100^ x 
100 200) and large P (L = 1.5 ^ 2.0 fm and a = 0.1 0.02 fm) for the 
study of b physics with as little extrapolation as possible in the mass of 
the heavy quark. 

The first item is heavily CPU limited, since one has to solve the Dirac 
equation repeatedly during the updating process. The second item is basically 
memory limited, due to very large lattice size. In both cases, our target is a 
resolution about two-times better than currently possible (implying, as discussed 
later on, an increase in computing power of two orders of magnitude). 

As a guideline to define a new LQCD engine for these classes of problems, 
we require that: 

1. The node-topology and communication network is optimized for the lattice 
sizes required in full QCD simulations. Since for many problems of LQCD 
it is important to perform a finite-size scaling analysis, it is desirable that 
the machine performs efficiently not only on large but also on compara- 
tively small lattices, e.g., in full QCD one may think of x Nt lattices 
with Nl = 16,20, ...,32 and 48, and Nl < Nt < 2Nl. For smaller lat- 
tices, as the required computing performance decreases, more traditional 
machines (such as PC clusters) or previous generation dedicated systems 
can be used. 

2. The communication network has enough bandwidth to handle the large 
degree of data exchange between neighbouring sites (and hence compute 
nodes) needed in LQCD computations. The interconnect architecture 
should support the natural (APE-like) programming model with direct 
remote data access This approach minimizes software and memory 
overhead (and coding effort) for pre-loading of remote data. 

3. The processing nodes sustain high performance on the execution of the 
arithmetic and control operations which are relevant for the codes (or 
at least their basic kernels) of fuU-QCD algorithms, in particular double 
precision floating point arithmetics, memory access to field variables of 
composed data structures, local and global program-flow control, etc. 

To obtain a good floating-point efficiency for the execution of a given com- 
putation, the compute power and memory bandwidth should be balanced 
accordingly. This balance is usually measured in term of the parameter R, 
defined as the ratio between the number of floating-point operations and 
the corresponding memory accesses (in the corresponding data format). 
A processor is balanced for a given algorithm if the R value required by 
the algorithm is roughly equal to the R value allowed by the processor 
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itself. In the case of the Dkac operator, which usuaUy dominates the cost 
in LQCD computations, a typical value is i? ~ 4. 

4. Memory size, disk space and disk-bandwidth match each other and are well 
suited to the problems we want to study. This means that all compute 
intensive kernels must not be slowed significantly because required data is 
not available in main memory. We must keep all data in physical memory 
as long as possible. In all cases in which this is not possible (e.g., for light- 
fermion propagators on very large lattices) we must be able to temporarily 
store on (and retrieve from) disk with large enough bandwidth. 

These requirements shape the global architecture of the machine: 

1. We consider architectures based on three dimensional grids of processors, 
with nearest neighbour data- links. Reasonable sizes of the mesh of pro- 
cessors that will be used for the simulation of large lattices are somewhere 
in the range 8'^ • • • 12'^ • • • 16'^ nodes, where a physical lattice of 48'^ x 96 
points can be readily mapped. For finite size analysis on small lattices, a 
mesh of 4'^ • • • 6'^ processors may be considered. 

The size of the processor mesh dictates a lower bound on the communica- 
tion bandwidth between neighbouring processors. We define by p the ratio 
of local memory accesses (transfers between processor and its memory) 
over remote memory accesses (transfers between neighbour processors), 
which depends on the lattice size and the algorithm. Under the assump- 
tion of balanced local bandwidth (i.e., processors are able to access enough 
data in local memory to sustain their potential performance, see later for 
details), effective bandwidth^ for remote communications must not be 
lower than 1/p times the local bandwidth. Estimates of the required ratio 
for a naive implementation of the Dirac operator using Wilson fermions 
are given in table |^ for a sub-lattice of n\ x Nt physical points and local 
time direction per processor (note that, to first approximation, p ~ 2?!^). 

A nice and simple trick can be used in the computation of the Dirac 
operator to reduce the number of remote accesses. For the negative di- 
rections the Dirac operator involves terms of the type Ujj^{x — ti)tp{x — /x) 
where the fermion term ip and the corresponding gauge matrix (U) must 
be fetched from the same place. We can therefore evaluate the product 
Ufj,{x — fJ-)ipix — /i) on the remote node and transfer the result only. In 
brief, all remote accesses involving gauge fields disappear. Table || contains 
the p values corresponding to the evaluation of the Dirac operator using 
the above mentioned technique. We consider the comfortably increased 
p values as an useful safety margin, that could be exploited to increase 

^ including tlie effect of the start-up latency for typical packet lengths. 
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Linear lattice size 


p 


3^ 


5.8 


43 


7.8 


63 


11.6 


83 


15.5 



Table 1: Local vs. remote memory access patterns: p is the ratio of memory 
accesses to local memory over memory accesses to neighbour nodes in a simple 
implementation of the solver for the Dirac operator, p is estimated as a function 
of the linear size of the sub-lattice mapped onto each processor. 

the floating point performance of each node, at fixed remote bandwidth. 
Clearly the actual values of p which can be accepted must be studied more 



Linear lattice size 


P 


3^ 


7.5 


43 


10 


63 


15 


83 


20 


3x6x6 


11.25 


.S X 4 X 4 


9 



Table 2: Local vs remote memory accesses: this table is the same as the previous 
one, except that p is estimated taking into account the trick, described in the 
text, that reduces remote accesses. The last two entries refer to non-square 
sub-lattices that might be used when simulating a lattice of spatial size 48^ on 
large machines with 16 x 8 x 8 or 16 x 12 x 12 nodes. 

carefully (possibly simulating architectural details of the mechanisms that 
hide remote communications) 

2. To discuss memory-size requirements in more details, one has to distin- 
guish between the case of full QCD simulations and calculations in the 
quenched approximation. 

In full QCD simulations, by far the largest amount of time is spent in 
the updating process. In this case, on-line memory has to be large enough 
to allow for the implementation of efficient algorithms. State-of-the-art 
update algorithms need a large number of auxiliary fields on each lattice 
site. We use as unity the amount of memory associated to one fermion 
field (24 data words, corresponding to 192 Bytes in double precision. We 
call this quantity a fermion equivalent - feq - in the following). A generous 
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gauge fields 


72 W 


3 feq 




fermion propagator 


288 W 


12 feq 




(pseudo-) fermion field 


24 W 


1 feq 




Pauli term for improvement 


72 W 


3 feq 



Table 3: Data structures used in LQCD and corresponding memory require- 
ments (in words and fermion equivalent storage) per lattice point. Greek indices 
run from 1 to 4 and Latin indices from 1 to 3. The first three entries are general 
complex matrices, while the Pauli term is hermitian: (cr • -F')"f = [{cr ■ F)^a]* ■ 

estimate, leaving space for more sophisticated, presumably more memory 
intensive algorithms, is about ~ 200 feq per site. 

On the other hand, in the case of the quenched approximation, the up- 
dating process may be neglected for both computing power and memory 
requirements (less than 10 feq per lattice site are needed). Instead, we have 
to consider the memory requirement originating from the measurement of 
a heavy-light form-factor. The database needed for such a calculation 
consists of one gauge field configuration, one Pauli term, Ni -\- Nh fermion 
propagators {Nh and Ni are the numbers of heavy and light fermions 
respectively), each replicated for the number of momenta and operator 
insertions used and for each lattice site (typical cases, being Nh = Ni = 4, 
3 momenta and one operator insertion). Quenched QCD will be used es- 
sentially for heavy quark phenomenology. Here the real problem is the 
extrapolation to the b quark mass. To be safe one should have a physical 
cutoff much larger than the masses that enter the simulation. Then large 
lattices, of the order of 100^, are necessary. 

We summarize our memory requirements in table |^ (where the size of the 
relevant data structures are presented) and in table ^, where actual mem- 
ory sizes are collected, under the assumptions of using double precision^ 
throughout. From the first two lines of table ^, we see that we cannot ex- 
pect to keep the whole data-base in physical memory when large lattices 
are considered. However, if only two propagators at the time are kept 
in memory, for ease of programming, while the others are either recalcu- 
lated (the heavy ones) or stored and reloaded from disk (the light ones), 
memory requirements reduce sharply (third line in the table). 

We conclude that, by judiciously swapping data to disks, a memory size 
of the order of > 1 TByte is a good compromise for both our case studies. 
Alternatively, one might consider two memory options: a small memory 

■^The necessity of double precision arithmetic in full QCD has been investigated in the 
literature |7l and will not be discussed here. 
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machine (~ 500 GByte) for full QCD and a large memory version (1-^2 
TByte) for quenched studies. 



Case 


updating 


measurement 


small lattice, full QCD 
large lattice, quenched QCD 
large lattice + disk 


400 G 
200 G 
200 G 


1.4 T 
13 T 
1.8 T 



Table 4: Total memory requirements for the case studies discussed in the text. 
The line labeled + disk refers to the case in which two propagators only are 
kept in memory (all others being swapped onto disk or recomputed). 

3. Fast input-output is mandatory, as obvious from the previous point, for 
studies on large lattices. As a rule of thumb, we may want to load or 
store one {large lattice) propagator (~ 250 GBytes) in little more than one 
minute. This requires a global bandwidth of the order of 2 — 3 GBytes/sec. 

For full QCD permanent storage of the configurations is required due to 
the computing effort needed to generate them. This is a storage-density 
(as opposed to bandwidth) problem which is independent of the machine 
architecture and should be discussed in a different context, with potential 
links with the GRID [|j project, likely to be supported by the European 
Commission. In the case of large lattices in quenched QCD the strategy 
of computing on the fly without saving configurations is the best. Only 
the final correlation functions are saved and this means at most a few tens 
of MBytes per configuration. 

Processing performance is strictly speaking not a clear-cut requirement: the 
more is available, the better. We can estimate how much is enough, however, 
by extrapolating the present state of the art. A sustained performance of 300 
GFlops (with perhaps 40% efficiency) is now heavily used for full QCD simu- 
lations on lattices of size 24"^ x 48 0. If we assume a critical slowing down 
where computer time grows like a~ [p^, we would like to have a sustained 
performance two orders of magnitude higher if we want to halve a. 

An ambitious target for our project is therefore a total installed performance 
in the order of 10 30 TFlops. From the point of view of physics requirements, 
it is not important that this computing power be sustained on a single system. 
Several smaller machines can perform equally well (or perhaps better), as long 
as each of them is able to handle large enough lattices. 

Also, we must envisage the operation of some lower performance (and cor- 
respondingly smaller memory) machines, where small lattices are handled and 



11 



algorithms, programs and physical parameters are tuned before a large calcula- 
tion is moved onto a large production machine. 

3 The APEmille Project 

In this section we briefly review APEmille. 

APEmille is the present generation APE project. It is based on the standard 
structure of a large array of processing nodes arranged at the edges of a three 
dimensional mesh and operating in SIMD mode. 

At present (November 2000), several medium-size installations are up and 
running, while several larger units are under construction (see table |^). Consid- 
ering all large and small machines, the integrated peak performance available at 
the end of the year 2000 will be about 1 TFlops at INFN and about 400 Gflops 
at DESY. The largest single system will have a peak performance of 250 (possi- 
bly 500 Gflops). Other institutions in Europe are procuring (or considering to 
procure) APEmille machines. 



Site 


peak performance 


status 


Rome 


260 Gflops 


running 


Zcuthen 


130 Gflops 


running 


Rome II 


65 Gflops 


running 


Bielefeld 


80 Gflops 


running 


Milano/Parma 


130 GFlops 


planned, Dec. 2000 


Pisa 


130 Gflops 


planned, Dec. 2000 


Rome II 


+ 65 Gflops 


planned, Dec. 2000 


Rome 


+ 260 Gflops 


planned, Dec. 2000 


Zcuthen 


+ 260 Gflops 


planned. Spring 2001 



Table 5: A short list of some large APEmille existing installations and of plans 
for the near future. 

In a typical critical LQCD kernel (a solver for the Dirac operator) coded in 
the high level TAO programming language, measured sustained performance in 
single precision is about 44% of peak performance and in double precision it 
is about 19% of peak single precision (i.e. 80% of peak performance in double 
precision). Higher efhciency can be obtained with more careful programming: 
we have pushed single precision performances up to about 58% of peak speed 
writing the key portions of the Dirac solver in assembly. 

In a later section, we will claim that an architecture a la APEmille continues 
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to be a very good choice for LQCD computing. We see however a number of 
problems in APEmille, all pointing to the development of a new generation 
system: 

• Peak Performance APEmille machines can be made larger than we plan 
to build, but not too large. The largest system that can be assembled with 
the present hardware building blocks is a configuration of 8 x 8 x 32 nodes, 
corresponding to 1 TFlops peak performance (APEmille systems can be 
configured in principle as 8 x 8 x 2n arrays). Still larger systems would need 
some minor hardware development and would probably be not convenient 
in LQCD, since they have an unordinately large number of nodes along 
one dimension. 

• Memory Size The very large APEmille machine described above has 64 
GBytes memory. This is still several times lower than discussed in the 

section on requirements. 

• Floating-point precision APEmille is basically a single precision ma- 
chine (performance decreases by factors from 2 to four in double preci- 
sion). However, in future large LQCD simulations double precision will 

be necessary in an increasing number of cases. 

• Little space for improvements APEmille is architecturally very sim- 
ple, since it relies on accurate and rigid hardware synchronization. This 

style of synchronization is difficult to support if the system clock is in- 
creased significantly. For this reason, we see little space for incremental 
improvements in performance. 



4 A Review of Similar Projects 

In this section, we gather some information on similar projects, carried out 
by other groups. To the best of our knowledge, the following activities are in 
progress: 

• CP-PACS 

The CP-PACS collaboration have made a feasibility study of a future 
project which follows the CP-PACS project. Extrapolating the data of 
the performance obtained in recent full QCD simulations on the CP-PACS 
computer, they have estimated the computer time required for a large- 
scale full QCD calculation, with the quality of data comparable to that 
of the present quenched QCD study on the CP-PACS. They assume that 
lattice action and the simulation algorithm are identical to the present 
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simulation on the CP-PACS. Their estimate, ~ 100 TFlops -year ||Tl|, 
is somewhat larger than the one of the ECFA panel. In addition to their 
feasibility study, CP-PACS are carrying out basic research on the following 
two topics, that they consider as very important technologies for the next- 
generation of massively parallel computers: 

— Development of an architecture of high-performance memory-inte- 
grated processor for the next generation massively parallel comput- 
ers. 

— Establishment of a model of parallel 1/0, parallel visualization and 
man-machine interface, which can process efficiently and flexibly the 
enormous amount of data generated by massively parallel computers. 

CP-PACS hope they will have a chance to develop a next-generation com- 
puter using the results of their basic research in the near future, but they 
do not have a project at the present time. 

• Columbia The Columbia group have officially embarked on the design 
and construction of their next machine |lj] . The design effort is still on a 
fairly high level with choice of processor and communications technology 
being the first questions that have been resolved. Most significant is the 
choice of microprocessor, which is provided by an IBM PowerPC core. 
This follows from an arrangement with IBM that permits to exploit pro- 
prietary technology to construct a full processing node (memory included) 
on a single chip. This feature provides the name to the new project QCD 
on a Chip (QCDOC). The node will contain a PowerPC 440 core, one 
64-bit, 1 Gflops FPU (an integrated part of the PowerPC architecture), 
4 MBytes of embedded DRAM and 8 bi-directional serial inter-processor 
links, each operating at 0.5 Gigabits/sec. If they are able to achieve this 
frequency, this would give a total off-node communications bandwidth of 
1 GBytes/sec. 

The group is now busy to determine the other details of the project and 
begin the detailed design of the node. 

We also include an arbitrary selection of two (out of the many) interesting 
examples of PC-based cluster architectures for comparison. 

• The Wuppertal Cluster ALICE 

The "Institut fiir Angewandte Informatik" at Wuppertal University has 
installed the flrst half of the Alpha-Linux-Cluster-Engine (ALICE) in 1999. 
When the system is fully installed, in May 2000, it will consist of 128 DSIO 
uni-processor workstations connected by a Myrinet multi-stage crossbar 
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switch. All CPUs will be upgraded to 600 MHz Alpha 21264 EV67 chips 
with 2 MBytes second level off-chip cache [ p^ . 

The cluster is intended to perform efficiently in several HPC application 
profiles at the University of Wuppertal, including computational chem- 
istry, electrical engineering, scientific computing and simulations of quan- 
tum field theories. 

Of particular interest is the operability of this self-made system in a Uni- 
versity's multi-user environment. In computer lab courses, the emphasis 
is on " Physics by High Performance Computers" . Several student groups 
use the system simultaneously in interactive mode much alike a desk-top 
system. 

A forward looking ALiCE-project, to be carried out together with the de- 
velopers of the ParaStation communication software from Karlsruhe uni- 
versity, deals with optimization of efficiency and data organization for AL- 
ICE under real life conditions, in particular with the goal to make parallel 
I/O and file system functionalities available. 

• PMS, The Poor Man's Supercomputer A PC cluster has also been 
developed at Eotvos University in Budapest [|l5j . The current version of 
PMS has 32 PC's. Contrary to the previous example, the PMS project 
has developed QCD-optimized communication hardware. They use dedi- 
cated add-on boards to establish physical communications between nearest 
neighbour PC's in a three dimensional array. The actual configuration of 
32 PC's can be imagined as a 2 x 4 x 4 mesh of processors. The system 
uses a standard Linux operating system and the favoured programming 
style is the well tested SIMD paradigm. 

The present version of PMS is shaped by the requirement to reduce costs 
as much as possible. Indeed, PMS uses cheap AMD K6-2 processors (de- 
livering only 225 MFlops each) while the special purpose communication 
interface has a bandwidth of just 2 MBytes/sec. We consider the PMS as 
a very good trade-off between the advantages offered by the use of general 
purpose systems and the performance boost that dedicated hardware is 
able to provide. 

• The MIT-Jefferson Lab project This project is organized in two 
phases. Phase I has been submitted to DOE in march 1999 for a 256- 
processor cluster at JLab and a 64-processor cluster at MIT. The building 
block is a 4-processor Compaq SMP node with a 750 MHz Alpha 21264 
chip, 4 MB cache/processor and 1 Gbyte memory/ SMP. The communica- 
tion network is based on Myrinet switches. They plan to achieve a cost per 
sustained MFlops lower than $10. The collaboration currently operates a 
12 node prototype cluster equipped with 667 MHz Alpha 21264 at MIT. 
They plan to complete phase I by 2001. The collaboration has a quite 
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intensive physics program mainly devoted to nucleon structure. Phase II 
of the project, not approved yet, foresees, in the years 2002-2005, multi- 
Teraflop machines from a combination of QCDOC and clusters. They 
are collaborating with Compaq to explore resources available at that time 
(EV7 and EV8). 



5 Technological Scenarios 

In this section we discuss forecasts about the state of the art for several enabling 
technologies in the years 2001-2002. We cover the following points: 

1. basic digital VLSI technology. 

2. memory technology. 

3. data- links. 

4. Off-the-shelf processors. 

5. The Crusoe architecture. 

We conclude the section with a discussion of the architectural implications of 
the technology-driven choice of overcoming the strictly synchronous operation 
of APEIOO and APEmiUe. 

5.1 VLSI Technology 

APEmiUe is based on a chip-set designed with a 0.5^ digital CMOS technology. 
A second source for the chip-set has been established, using a more advanced 
0.35/j, technology. In the next few years, 0.25/i and 0.18/x CMOS technologies 
will be readily available. 

A comparison of some key features of the silicon technologies used in APE- 
miUe and of a representative of both 0.25/i and 0.18/i technologies is made in 
table ^ 

The figures quoted in the table refer to processes that are (or will be) readily 
available through the same European silicon broker that helped us develop the 
second source of the APEmiUe chip set. 

Let us consider a scaled version of the APEmiUe processor. If we use a 0.18/i 
process, it should be easy to reach a clock speed between three to five times 
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Feature 


ES2 O-S^i 


Alcatel 0.35^ 


UMC 0.25/i 


UMC 0.18/i 


Vdd 

Gate delay 
Gate density 
Memory (IP) 
Memory (2P) 
Power / gate 


3.3 V 
180 ps 

10 K/mm^ 

11 Kb/mm^ 
6 Kb/mm^ 

O.S^iW/MHz 


3.3 V 
100 ps 
20 K/mm2 
25 Kb/mm2 
8 Kb/mm^ 
0.4/iW/MHz 


2.5 V 

75 ps 
45 K/mm^ 
44 Kb/mm^ 
16 Kb/mm^ 
0.2^W/MHz 


1.8 V 

36 ps 
90 K/mm^ 
85 Kb/mm2 
30 Kb/mm2 
0.1/iW/Mhz 



Table 6: A summary of some key parameters for digital silicon technologies used 
in APEmille and proposed for apeNEXT. All values are directly obtained from 
the relevant silicon foundries, except for the bit density of 1 Port or 2 Ports 
memory arrays in the UMC technology. The latter are based on conservatively 
applied scaling rules 

higher than in APEmille, while we may expect to squeeze up to 9 times more 
transistors onto the same silicon area. We can stay on the safe side planning to 
use a clock frequency of 200 MHz. An LQCD optimized processor running at this 
clock frequency with one floating-point pipeline would peak at 1.6 Gflops , using 
the well known normal operation ax b + c, performed on complex operands. 
A chip three times more complex than JIOOO (and three times faster) would 
dissipate less than two times more power. 

5.2 Memory Technology 

We limit ourselves to memory systems used in future high-end PC's or low- 
end workstations. This choice (the same as APEIOO and APEmille) should be 
the most effective to provide the highest level of integration, reduce costs and 
guarantee part availability. 

In the near future, planned memory systems are either RAMBUS DRAM's 
or DDR SDRAM 'sf] 

The DDR SDRAM (Double Data Rate Synchronous DRAM), is the evo- 
lution of the mature SDRAM (Synchronous DRAM) technology (widely used 
in the APEmille machine). The SDRAM is a low latency burst oriented de- 
vice made of multiple (2 to 4) banks of asynchronous DRAM controlled by a 
synchronous controller which allows pipelining of the I/O interface (one word 
is accessed for every clock cycle). The Double Data Rate architecture realizes 

^In the following we do not distinguish between traditional DDR SDRAM and new 
"flavour" DDR SDRAM like Sync-Link because both are evolutionary designs of the same 
basic structure. 
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two data transfers per clock cycle using both edges of the clock and one special 
reference signal to fetch corresponding data. 

The RAMBUS is a more advance memory architecture which works as a chip- 
to-chip system-level interface rather than a conventional memory device. The 
RAMBUS RDRAM (which stands for RAMBUS Direct Dram) shares the same 
architectural idea of the SDRAM one, a core asynchronous plus a synchronous 
controller. It makes use of a large degree of parallelism (32 interleaved memory 
banks) on a narrow internal bus. The RAMBUS RDRAM is based on the 
Direct RAMBUS Channel, a high speed 16-bit bus at a clock rate of 400 MHz, 
which thanks to the adoption of a dedicated signalling technology (RAMBUS 
Signalling Level) allows 600 MHz to 800 MHz data transfers. 

In table |^ we summarize the main features of the two technologies, for cur- 
rently available and next generation (less than 2 years from now) chips. 





DDR 


RDRAM 


DDR 


RDRAM 


Data rate 


200 MHz 


800 MHz 


400 MHz 


800 MHz 


Memory size 


256 Mbit 


128/144 Mbit 


1 Gbit 


256 Mbit 


Organization 


x4,x8,xl6 


xl6,xl8 


xl6,x32 


xl6,xl8 


Peak bandwidth 


0.4 GB/s (xl6) 


1.6 GB/s 


1.6 GB/s (x32) 


1.6 GB/s 


Package 


TSOP(66) 


BGA 


TSOP(80) 


BGA 


Power (VCC) 


2.5 V 


2.5 V 


1.8/2.5 V 


1.8 V 


I/O type 


SSTL2 


RSL 


SSTL (?) 


RSL (?) 


Power cons. 


80 mA 


330 mA 


7 


? 


Cost (norm.) 


1.0 


1.8 


? 


? 


Sample/Prod. 


Now /Now 


Now/ Now 


3Q99/4Q00 


? 



Table 7: A summary of several important figures for two options of dynamic 
RAM's. The second and third columns refer to presently available DDR and 
RAMBUS devices. The fourth and fifth columns refer to the expected evolution 
of these devices in the next two years. 

Some comments are in order: 

• The simple architecture of the DDR SDRAM allows larger memory size 
per device. For a given fixed amount of memory, this reduces the number 
of used components. 

• Since power consumption is proportional to the interface clock (a factor 4 
between RAMBUS e DDR), aggregated memory systems using the DDR 
SDRAM reduce the global consumption. 
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• On the other hand the extremely high peak bandwidth of the RAMBUS 
allows to build a very fast memory system with minimum impact on board 
space occupancy (compact BGA packaging). 

• The logic complexity of a RAMBUS interface is much larger than for a 
DDRAM controller (the latter could be easily designed on the basis of the 
experience done in the realization of the APEmille memory controller). 
On the other hand, several silicon foundries make a RAMBUS controller 
available as a core cell. 

We conclude this section by presenting in table || two possible DDRAM-based 
memory systems for apeNEXT. The performance target is set by our basic 
performance figure, discussed in the previous subsection of 1.6G flops and R = 
4, leading to a bandwidth requirements of at least 3.2 GBytes/sec (assuming 
double precision data words throughout). 



chip-size 


1 Gbit 


1 Gbit 


chip organization 


32 bits 


32 bits 


chip number 


4 


2 


word size 


128 bit 


64 bit 


bank size 


512 MBytes 


256 MBytes 


frequency 


300 MHz 


400 MHz 


total bandwidth 


4.8 GBytes/sec 


3.2 GBytes/sec 


power consumption 


640 mW 


400 mW 



Table 8: Basic features of two possible memory systems for apeNEXT based 
on DDRAM memory technology. Power consumption is estimated by rescaling 
data available for present generation systems 

In conclusion, forthcoming memory technology is adequate to support the 
processor performance discussed above. There is in fact reasonable space to 
consider either fatter node processors, or multi-processor chips. 

5.3 Data-link Technology 

We now consider remote communications which, in our opinion, is a key tech- 
nological challenge for the project. 

Assuming our reference figures - 1.6 Gflops per node, along with i? = 4, and 
p — 8 (as defined in the previous sections) - we require an inter-processor com- 
munication bandwidth of about 400 MBytes/sec. As discussed above, several 
code optimization steps are able to reduce the amount of data to be transferred. 
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The overlap between computation and communication can also be increased. All 
this steps reduce bandwidth requirements. We will stick however to the previous 
figure, so a large safety margin is established. 

The needed communication patterns arc however very simple: communica- 
tions are needed between nearest-neighbours (L-shaped paths, between next- 
to-nearest neighbours are also useful) in a 3-d array of processors, where each 
processor has 6 direct links to its nearest neighbours. The real challenge in this 
area is therefore more the implementation of a fast, reliable and cheap link than 
the development of any clever routing strategy. 

In APEIOO and APEmille, links use large, parallel and synchronous data 
paths. Data words are injected at the transmitting end of the line following a 
rising transition of the clock and arc strobed into the receiving end of the line 
at the next rising edge of the clock. This works if 



where Tt is the travel time over the physical link, Tdock is the clock period and 
^Tciock is the phase spread between (nominally aligned) clock signals at various 
places in the machine. The conditions are met in APEmille, where Tdock = 
30ns, Tt ^ 10ns and ATdock — '^ns, while they become clearly unrealistic for 
frequencies of ~ 200 MHz. 

More advanced (high bandwidth) link technologies have recently become 
available, in which data and timing information are both encoded on the physical 
link, so asynchronous operation is possible. In the bandwidth range relevant for 
us, we have considered three different options: 

• Myrinet-like links. The physical layer of the Myrinet interconnect uses 
low swing single-ended signalling. One byte is encoded onto ten signal 
lines, carrying also timing information. The full duplex link uses two 
such busses. The present generation Myrinet link has a bandwidth of 160 
MBytes/sec (using both edges of an 80 MHz clock), while a new genera- 
tion (Myrinet-2000, 320 MBytes/sec) is under test. The main advantage 
of Myrinet links is that they pack a lot of bandwidth while keeping oper- 
ating frequency low. Board layout details, connectors and cables are also 
very well tested. We are informally discussing with Myricom the possi- 
bility to use this link for apeNEXT. Myricom have agreed to allow us 
to use the link level (SAN-port) circuitry for their latest Myrinet chips 
(Myrinet 2000) as a basis for the apeNEXT links. Under a suitable 
non-redistribution agreement, Myricom will make available to the collab- 
oration the layout of the basic cells, along with their Verilog models. 



Tt < Td ock 

ock 



(1) 

(2) 



20 



• LVDS based links. The Low Voltage Differential Signalling (LVDS) 
technology is now widely used in many telecom and network technologies, 
like the Scalable Coherent Interface (SCI). LVDS is designed to work up to 
~ 622 MHz. Several redundant encoding schemes (e.g., 8 bits into 10 bits) 
have been proposed. LVDS cells are readily available from several silicon 
vendors. New generations FPGA chips have been announced including 
LVDS options. Work is in progress to test LVDS links, as described later 
in this document. 

• High speed proprietary links. Several silicon houses (e.g., Texas In- 
struments (TI), National Semiconductor(NSC), LSI Logic) have devel- 
oped very high speed proprietary links, aimed at the Gb Ethernet, Fiber- 
Channel, Infini-Band markets. The typical bandwidth is higher than 1 
Gbit/s. Complete encoding-decoding black-boxes are usually available. 
This option has two main drawbacks: it makes the whole project depen- 
dent on a specific silicon house, and requires very careful layout of the 
printed circuits and proper choices of cables, connectors and the like. 



Basic figures of the three options are compared in table g, where we use for 
the LVDS case a more conservative frequency of 400 MHz. 



Technology 


Frequency 


Pins 


Bandwidth 


Power Dissipation 


Myrinet 


160 MHz 


20 


320 MBytcs/s 


300 mW 


LVDS 


400 MHz 


40 


400 MBytes/s 


200 mW 


ScrDcs(TI) 


1.24 GHz 


10 


400 MBytes/s 


400 mW 


NSC DS90CR483/484 


784 MHz 


18 


672 MBytes/s 


1500 mW 



Table 9: Basic figures for several link technologies. All figures refer to full duplex 
links. An 8 bit into 10 bit encoding has been assumed for the LVDS case. 

An important issue is the reliability of the network, usually measured in 
BERR (average number of errors for transmitted bit). If we require fault-less 
operation of a large machine for one day (say, 2000 links active for 50% of the 
time), we need a very low value of BERR ~ 10"". For comparison's sake, 
measured stable operation of an APEmille machine with 250 nodes for periods 
of a few days implies BERR < 10"^^. 

Machine reliability greatly improves if the network is able to recover from 
network errors by retrying a failed communication (this impacts on link latency, 
but the impact can be made low with some care). For instance a comfortable 
BERR ~ 10~^^ implies that one communication must be retried on the machine 
every second. 
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The above discussed feature however requires some degree of non asyn- 
chronous operations, with important technological implications. Regardless of 
the technological choice made for the processor, we think that no real advan- 
tage is gained by departing from the Single Instruction Multiple Data (SIMD) 
or Single Program Multiple Data (SPMD) programming style used in previ- 
ous generation APE machines. At the hardware level, APE processors of all 
previous generations have been hardware-synchronized with an accuracy of a 
fraction of clock cycle. Although logically very neat, this is rapidly becoming 
impossible, for clock frequencies higher than 100 MHz and across physical scales 
of several meters. We consider an approach in which independent processors, 
while running at the same frequency, are only loosely synchronized. Logical 
synchronization will have to be enforced by some form of software-controlled 
barrier. 

5.4 Off-the-Shelf Processors 

In this section, we briefly consider of-the-shelf processors as a potential building 
block for the computational core of apeNEXT. With one notable exception 
(see later), we choose to consider only the option of using commercially available 
boards (in other word, if we decide to use a commercially available option, we 
want to drop altogether any hardware development not involving the network). 
In the following section we will compare the relative merits of off-the-shelf versus 
custom processors. 

Standard off-the-shelf processors have increased in performance by more than 
one order of magnitude in the last 8-10 years, with an even more remarkable im- 
provement in the efhciency of floating point computations. Standard PC boards 
using off-the-shelf processors have been used for small scale LQCD simulations. 
The relevant codes are written in familiar programming languages, like C (or 
CH — h) or Fortran. Efficiencies are limited by bottlenecks in memory access as 
soon as the data base involved in the computation exceeds the cache size (which 
is the typical situation in realistic LQCD simulations). Measured efhciencies on 
a Pentium II processor running at 450 MHz arc of the order of 30%, for real-life 
production programs (running on just one node, i.e., with no communication 
overheads) fl^ . 

A detailed discussion of the expected technical road-maps for off-the-shelf 
processors in the next few years in general terms would exceed the scope of this 
document. Instead, we discuss the features of a typical high end microprocessor, 
that might be used today and apply usual scaling laws. For this purpose, we 
(rather arbitrarily) take the AMD Athlon. A number of features relevant for 
LQCD simulations are shown in table 
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Clock frequency 


600 - 1000 MHz 


FP ops (single precision) 


4 per clock cycle 


FP ops (double precision) 


1.6 per clock cycle 


FP latency 


15 clock cycles 


LI Data Cache 


64 kBytes 


Data bandwidth to L2 cache 


1.6 GBytes/sec 


Sustained LQCD performance 


360 MFlops 


Power consumption (750 MHz) 


35 W 


Retail price (600 MHz) 


200 Euro 


Retail price (750 MHz) 


375 Euro 


Retail price (800 MHz) 


500 Euro 



Table 10: Features of the AMD Athlon processor relevant for LQCD simulations. 
Sustained performance is estimated under the assumptions discussed in the text. 



In the rest of the discussion, we consider the version of the Athlon running 
at 750 MHz. Indeed, figure ^ shows that for higher frequency, power dissipation 
increases faster than linearly. 

If we assume an efficiency comparable to the one measured on Pentium sys- 
tems, we expect a sustained LQCD performance of ~ 360M flops per processor, 
making it possible to use dual-processor mother-boards without jeopardizing ef- 
ficiency (a quad-processor system would saturate the maximal theoretical band- 
width of 1.6 GBytes/sec to access a memory bank working at 200 MHz assuming 
our usual value of i? ~ 4) . 

In summary, a high end PC-like node should be able to sustain a performance 
of ~ 700M flops running LQCD codes in double precision. We can take this as 
our basic building block, with just a few relevant figures summarized in table 11 



Sustained performance 


700 MFlops 


Power dissipation 


90 W 


Tag price 


1500 Euro 



Table 11: Basic figures for a PC-based node of an LQCD engine, using cur- 
rently available off-the-shelf hardware. Price estimates are made at current 
retail prices. They include 512 MBytes main memory. No LQCD networking 
or infrastructure is considered. 

This nodes needs a sustained interface to neighbour nodes in the three di- 
rections of the lattice grid with a bandwidth of ~ 200 MBytes/sec. 
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Figure 1: Power consumption (W) of the AMD Athlon processor as a fmiction 
of the clock frequency (MHz) |17| . 



In conclusion, a system delivering 1 TFlops sustained LQCD performance 
would cost more than 2.2 MEuro in processors only, and dissipate more than 130 
KW power. We will discuss the implications of these numbers in the following 
section. 



5.5 The Crusoe Architecture 

Very recently a new processor architecture (known as the Crusoe) has been pro- 
posed by Transmeta Corporation. The Crusoe is advertised streamlined 
(hence very low consumption) processor, optimized for laptops or other mobile 
computers. The Crusoe has a very simple architecture, that, when used behind 
a core-level software environment, emulates the Intel X86 architecture. From 
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our point of view, it is more relevant that the architecture of the Crusoe is 
extremely similar to the combination of the processing chips used in APEmille. 
Basically, the Crusoe core is a micro-coded system in which several functional 
units operate concurrently on data coming from a medium-size register file (see 
figure ||). The chip has also a data instruction cache, as well as two different 
memory interfaces. 



Micro-code Word 



FADD 


ADD 


LD 


BRCC 



FP unit 



Integer ALU 



Load/Store 



Branch 



Figure 2: The Crusoe architecture (adapted from |18 



A high-end implementation of the Crusoe (advertised as available from Sum- 
mer 2000) is called the TM5400. It runs at 500 (maybe 700) MHz and dissipates 
about 2.5 W, when running at full speed. 

At present, no Crusoe-based boards are available. It is likely that the first 
commercial products using Crusoe processors will be laptop machines, that 
obviously do not meet our requirements. We have therefore to consider the 
option of building a Crusoe-based apeNEXT processing board. 

The main advantages of this choice are basically summarized by saying that 
we would be using an architecture very similar to APE, while being spared the 
burden of designing our own processor. 



We have contacted Transmeta to explore this option. They stated that: 



• They are not ready to provide critical details of the internal architecture 
(for instance, no information was provided on how many floating point 
operations can be executed at each clock cycle). 

• Sufficient details of the VLIW core will not be given. Indeed Trans-Meta 
attitude is that all programming for the Crusoe must be done at the level 
of the Intel architecture, and must be translated with their proprietary 
software. 
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With these pieces of information available and considering also that: 

• It is not clear whether chips can be procured at an early enough stage of 
the project. 

• It is not obvious how fragile the whole Crusoe endeavour is. 

we think that the present situation does not suggest to base a new project on 
Crusoe. Of course, we will keep a close watch on any related development. 



6 Custom or Off-the-Shelf Processor 

Previous generation LQCD projects have used either custom processors, or sub- 
stantial enhancements to standard processor architectures or processors devel- 
oped for niche applications. No big project has been based on standard off- 
the-shelf processors so far. Today, a decision to follow the same path is not as 
obvious as it has been in the past, since off-the-shelf processors have increased 
in performance by more than one order of magnitude in the last 8-10 years, with 
a remarkable and even more relevant improvement in the efficiency of floating 
point computations. 

In table [ij we compare a few numbers relevant for APEmille, for the PC- 
based solution discussed in the previous section and for a custom-based ape- 
NEXT architecture (in this case, we use several tentative numbers discussed 
in early sections). 





APEmille 


apeNEXT: PC-based 


apeNEXT: custom 


Peak performance 
Sust. performance 
Power Dissipation 


500 MFlops 
250 MFlops 
1.5 W 


1200 MFlops 
360 MFlops 
35 W 


1600 MFlops 
800 MFlops 
3.0 W 



Table 12: Comparison of several key figures for APEmille processors and possi- 
ble options for apeNEXT. We assume that a next generation custom processor 
has the same efficiency as APEmille. 

No clear cut best choice emerges from these numbers. In general, we see 
advantages both in custom architectures and in PC-based architectures: 

We believe that a custom architecture is superior for very large (> bOOnodes ) 
systems for the following reasons: 
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• lower power consumption by one order of magnitude. 

• significantly more compact mechanical design. 

• better scalability once the basic units are operating (reliability and soft- 
ware issues of large systems). 

• easier interfacing with the necessary custom remote communication net- 
work and the host system. 

• better control of technological aspects and less dependence on changing 
commercial trends during the realization of the project. 

On the other hand, we see several advantages stemming from the use of 
PC-derived systems for smaller machines: 

• limited hardware development effort. 

• standard software is readily available for major parts of the compiler and 
the operating system. 

• short lead time to commission a prototype system. 

We see at this point the need to make a clear decision between the two 
options: we decide to focus on the development of a LQCD architecture based 
on an APE- like custom processing nodes, whose architecture is described in the 
next sections. We base our decision on the following points: 

• we want to focus our project onto machines with very large performance. 

As explained earlier on, we will have to put together several machines to 
really arrive at a VERY LARGE scale. 

• we think to be able to rescale and reuse a large wealth of building blocks 
from APEmille, reducing the design time. 

• We think that the commissioning of a very large PC-based system (involv- 
ing several thousand PC's all over the collaboration) is a huge (and new 
for us) project in terms of hardware (thermal and power management, 
availability in case of hardware failures) and software (control of a large 
network) issues for which we have no real background. 

We obviously think that a PC-based system is still a viable alternative (dis- 
cussed at some length in the preliminary proposal) for small or medium-scale 
systems. At this point in time, we do not consider however the development 
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of such a PC-based cluster as a priority for the apeNEXT project. We are 
however wiUing to collaborate with any such project, making any apeNEXT- 

propcr development that might be useful for a PC-based LQCD cluster readily 
available for such purpose. To this end, two points are most important: 

• We plan to design the network processor, supporting LQCD-optimized 
point-to-point communication in such a way that it can be easily interfaced 
to a PC (say across a PCI interface). See the section on the network 
architecture for more details on this point. 

• We start from the beginning the development of a programming envi- 
ronment that allows easy porting between PC-clusters and apeNEXT 
systems. 



7 Architecture Outline 

In this section, we outline an architecture, leading to stand-alone apeNEXT 
systems scalable from about 100 Gflops to about 6 TFlops peak performance. 

Just one such high-end machine would offer a ten-fold increase in peak per- 
formance with respect to currently available systems. Several (5 to 10) high-end 
machines, working together with a comparatively larger number of low-end sys- 
tems, would allow to complete the physics program outlined in previous para- 
graphs. 

We propose the following structure: 

• a three dimensional array of processing nodes, linked together by nearest- 
neighbour links. Each node is a complete and independent processor. All 
nodes execute the same program and are loosely synchronized, i.e., they are 
started at approximately the same time and proceed at approximately the 
same pace. They synchronize when requested by the logical consistency 
of the program (e.g., before exchanging data). 

• Remote communications use FIFO-based weakly asynchronous connec- 
tions between neighbouring nodes. The SIMD/SPMD programming style 
a la APE does not require complex handshaking protocols, since trans- 
mitting nodes may assume that the receiving partner is always ready to 
receive the incoming message. 

This simple mechanism brings several architectural advantages: 

1. It allows to use for the remote communications a programming style 
which is very similar to APElOO/APEmille. The latter has the very 
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convenient feature that no explicit distinction between local and re- 
mote memory accesses is required when coding a program. 

2. This programming style can be easily modified to allow hidden data 
transfers (data are moved on the links while the processing node is 
performing calculations) . 

3. It drastically simplifies the global hardware synchronization logic of 
the system. 

• The communication interface is in principle an independent component. 
As discussed, the communication interface is conceptually based on FIFOs, 
allowing "elastic" connections between nodes. This novel feature has to 
be carefully simulated, but no serious problem is anticipated here. We 
need a fast, yet cheap and reliable^ data-link. Using p ~ 8, we need links 
of 400 MByte/sec. As discussed in the section on technology, two or three 
different solutions are available. 

As discussed in the previous section, we focus our project on an implementa- 
tion of the above outlined architecture based on a closely packed array of custom 
processors. We have in mind an implementation allowing to build systems of 
between about 1000 to about 4000 processors, along the following lines: 

• Each node is based on a VLSI processor running at about 200 MHz clock. 
The processor merges the functions of the control (TIOOO) and floating- 
point (JIOOO) processors of APEmille on a single chip. Each node has a 
private memory bank, based on commodity chips. Memory size per node 
is likely to be in the range 256 MBytes - 1 GByte per node. The actual 
choice may be heavily affected by cost factors. The basic floating point 
instruction is the complex normal operation, so peak performance is 1.6 
Gflops (double precision). As already remarked, this requires a memory 
bandwidth of 3.2 GBytes/sec {R — A). We are studying the possibility to 
increase performance by factors 2-^4, by using some form of super-scalar 
or vector processing, in which several normal operations are performed 
concurrently. 

• A typical large system has between 8x8x16 — 1024 and 16x16x16 — 4096 
nodes. We assemble nodes on processing boards, similar to APEmille. 
Each processor is more compact than in previous generations, and glue 
logic is almost completely absent. 

• The node (and the network) should support not only data transfers be- 
tween memory and register (as available on APEIOO and APEmille), but 

''As already stated, due to asynchronous operations of the machine, requirements on the bit- 
error rate of the communication system are less demanding than in previous APE generations, 
since it allows for repetition of transfers with minor performance loss. 
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also register to register. This can be used to reduce bandwidth requests 
by splitting a complex computation on more nodes, each node using local 
data as much as possible, as remarked earlier. 

• A host system similar to the one used in APEmillc is a possible choice for 
the new machine. Based on networked Linux PC's and the CPCI bus, it 
is mechanically compact and reliable. Each PC will be in charge of several 
boards. The actual number of boards connected to each PC is dictated by 
the bandwidth available on the PCI bus to move data from APE to disk 
and vice- versa. For the sake of definiteness, assume a system distributed 
on approximately 100 boards, with a total bandwidth of 2 GBytcs/scc 
(that is 20 MBytes/sec per board). In this case, up to 4 boards can 
be handled by present generation CPCI CPU's. Higher performance PCI 
busses (double size and/or double speed) may allow to increase the number 
of boards connected to each PC. The host PC's will be networked with 
the most appropriate technology available in due time. 

• We plan to take advantage of all handles offered by the non fully syn- 
chronous structure of the machine to relax the requirements and to sim- 
plify the structure of the host to apeNEXT interface. 

Basically we will hook the interface to just one or two nodes belonging 
to each apeNEXT board. (This can be done conveniently by connect- 
ing to the corresponding network interface). All complex patterns of in- 
put/output data movements, for instance relevant to a write onto disk of 
a "slice" of apeNEXT processors, are best performed by assembling the 
data words onto the input /output nodes under program control, and then 
issuing a single data transfer to disk. 

We can load executable programs in a similar way, by first moving the 
code to the input/output nodes and then having a "loader program" to 
move the data onto the whole array. 

We need a lower level system able to access all nodes independently even 
if the neighbouring nodes do not work correctly. This system is needed 

for debugging and test purposes and (for instance) to start the "loader" . 
Speed is not relevant in this case, so well tested standard systems (such 
as the JTAG interface) would perfectly fit our needs. 

• We note that it is a relatively easy task to design the (fully self-contained) 
processing node(s) in such a way that they can be connected to a standard 
PCI desk-top PC. This possibility is very appealing for program debugging 
and small scale application. We plan to pursue this design characteristic. 

In the following sections, we describe in more details some key components 
of our new system. 
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8 Architecture of the Custom Node 



In this section we present the architecture of a simple custom node for ape- 
NEXT. The main idea guiding our design has been that of reusing bits and 
pieces of APEmille as long as possible, while making use of technologic improve- 
ments to rescale performance as much as possible. We chose these guidelines in 
order to shorten the design cycle. 

The custom node on which a large scale apeNEXT system is based is called 
J&T, since it combines the functionalities provided in APEmille by the con- 
trol processor (TIOOO) and the mathematical processor (JIOOO). The combined 
processor shares just one memory bank. 

A basic block diagram of the architecture is shown in figure ^. The picture 
does not cover in detail the memory and network interface. These points will 
be discussed later on. J&T is centered around the register file, whose structure 
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Figure 3: Architectural block diagram of J&T 



is the same as the one used in APEmille. Data are transferred from memory 
to register file (and back) through a bi-directional port. Data available on the 
register file can be operated upon in just a few ways: 
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• Data words can be fed to the mathematical processor (the set of three units 
within the red dashed frame). The latter contains a floating-point data- 
path (Floating-point Building Block, FBB), an integer arithmetic unit 
(Integer Building Block, IBB) and a further unit providing first approx- 
imations of some useful mathematical functions, such as ^/x, l/a;,expa;. 
This block is known as a Look-Up- Table Building Block (LBB) in APE 
jargon. Results of the mathematical block are written back to the register 
file (for later reuse or store onto the memory). 

• Data can be moved to the Address Generation Unit ( AGU) , where a mem- 
ory address or a branch-address can be computed out of two register- 
operands and one immediate-operand (the displacement). New (data or 
branch) addresses are stored in appropriate registers to be used at the 
next memory reference or branch. 

• Logical tests can be evaluated on results computed from the mathemat- 
ical processor. The outcome of such tests goes onto a stack where more 
complex logical conditions can be evaluated. The top of the stack is used 
to control program flow by acting on the program-counter circuitry (cor- 
responding to if (...) then in high level programs) or to block write 
operations onto memory or register file (where (...) statements in APE- 
likc high level programs). 

The processor is controlled by a relatively large program word (called the Micro- 
code Word) directly controlling the various devices in the node. (Almost) no 
instruction decoding is performed on chip. This scheme has been successfully 
used in the node processors of both APEIOO and APEmille. A word size of 128 
bits is large enough to control the system. 

In the following, we describe in more details several key units of the proces- 
sor. 



8.1 The memory Interface and the Network Interface 

In this section we describe the memory and network interface, sketched as mem- 
ory grey box in figure A basic structure of this subsystem is shown in fig ^. 
The diagram shows several paths: 

• there is a direct data path from the Register File to the physical memory 
interface (and vice versa), supporting normal memory access. 

• Data from memory can be also fed to the Network Interface (and eventu- 
ally routed to a remote node) . Conversely, data arriving from the Network 
(from a remote node) can be routed to the Register File. 
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Figure 4: Top level block diagram of the memory and network interface. 



• Data words may be sent to the network from the register file. This is a 
novel feature, allowing register-to-register remote communications. This 
feature reduces remote bandwidth requests in some cases (notably in the 
evaluation of the Dirac operator). 

• The network interface receives data from the memory (or the registers) 
and route it to the appropriate destination through one of the six links 
(Details on the network itself will be provided later on) . 



In figure || we show a more detailed view of the interface to the physical 
memory. We see that memory is divided into cache memory and external mem- 
ory: 



• external memory. External memory implements the large memory bank 
of the node. As discussed in the section on technologies, we may use (for 
instance) DDR 1 Gbit memory chips. We have several options of memory 
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Interface to memory chips 




Figure 5: A lower level view of the memory system. 

bus width and bank size satisfying bandwidth constraints (see table ^ . We 
want to leave these options open at this point in time. For this reason, 
figure H still has a grey box. This grey box contains the actual state 
machine controlling memory access, memory correction circuitry, refresh 
control circuitry and any other ancillary logic. The box will be designed in 
detail at a later stage of the project, after the actual memory technology 
has been selected. For the moment, we model the block by a simple 
interface in which data words coming from the memory are validated by 
an ad-hoc signal. 

• cache memory A limited amount of on chip memory is needed in the 
node. Fast access on-chip memory will be used to store control variables 
(i.e. loop counters) and memory pointers. These variables were stored 
in the data memory block of the control processor (TIOOO) in APEmille. 
Indeed, these variables have very irregular access patterns and very short 
access bursts. The use of relatively long-latency dynamic memory would 
adversely impact performance. On-chip memory does not need to be very 
large, of the order of IK data words. Note that, in spite of the name, this 
is not a true hardware controlled cache system, since the decision to store 
variables on-board or otherwise is statically made at compile time (one 
very simple strategy would be to store on-board all non vector integer 
quantities defined by a program). 
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In any case, as seen by the processor, the memory interface has a word-width 
of 128 bits (one complex double precision number) and provides one new word 
at each clock cycle in burst mode. Addressing is done on 64-bit boundaries (so 
real and integer variables can be stored efhciently). 

8.2 The Instruction Cache 

Actual LQCD simulations typically spend an extremely large fraction of the 
running time in just a few critical loops. For instance, a fuU-fermion hybrid 
Monte Carlo code spends nearly 95% of the time in the kernel used to compute 
the value of the Dirac operator on the fermion fields. Under these conditions, 
an instruction-cache system should have very large efficiency. We may exploit 
this feature by storing node programs in the same memory bank as data, with 
obvious advantages in terms of pin-count, real-estate reduction and cost savings. 

We consider a control word (micro-code word) of 128 bits, equal to the word 
size that can be fetched from memory at each clock cycle. We may modify the 
memory interface as shown in fig ^ 

Consider for the moment just the instruction FIFO. The memory controller 
(not shown in the picture) continuously looks-ahead and prefetches instructions 
from the memory, at all machine cycles in which data-memory transactions 
are not in progress. Under the fully pessimistic assumption that all program 
cycles involve data-memory accesses, this mechanism reduces performance by 
a factor < 2. Now consider the instruction cache. The instruction cache is 
loaded when or before the critical kernels are executed the first time (possibly 
under program control: the program writer may advise the compiler through 
appropriate directives that some routine or do loop is a critical kernel). The 
program then completes all following loops fetching instructions from the cache 
without incurring in any time penalty. The expected efhciency e is 

'=/ + 2x\l-/)' 

where / is the fraction of cached program instructions. If we expect to cache 
90% of all used instructions (a rather pessimistic value) we may still expect 90% 
program efhciency. 

The size of the cache needed to accommodate the computational kernels is 
an important parameter. We have analyzed several LQCD kernels used in TAO 
and TAOmille physics programs and we have found that a cache size of the 
order of 16 k Words is large enough. 
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Figure 6: The instruction cache and the program look- ahead system. 
8.3 The Register File 

The register file has the same architecture as in APEmille. The register file 
has three read-only ports, one write-only port and one bi-directional port. The 
read-only ports are used to feed data to the mathematical processor, while the 
write-only port stores data from the mathematical processor onto the RF. The 
bi-directional port is used for memory access. All ports can be used at each 
clock cycle (five independent addresses are needed). 

The word size of the processor is 64 bits, and complex numbers are stored 
as pair of (adjacent) registers. 

The depth of the register file affects the performance of the node. If there 
are not enough registers available, temporary results cannot be hold on-register. 
Memory bandwidth requirements increase and program efhciency decreases. In 
APEmille, 512 registers (or 256 register pairs) were used. Table ?? lists the 
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number of registers used by critical LQCD kernels programmed in TAO and 
compiled for APEmille. As we see, less than 256 are needed in all programs. 
We plan to design a register file of the same size as APEmille (512 lines). If we 
find out that such a large system does not work at the required speed, we know 
that the size can be halved without serious problems. 

8.4 The Mathematical Processor 

The computing engine contained in J&T performs three tasks: 

1. it performs the floating-point (FP) (and, less frequently, integer) arith- 
metic operations heavily used in any scientific code. This is of course the 
most important functionality of the computing engine. All design trade- 
off's must be guided by the aim of achieving highest possible sustained 
performance for the most relevant tasks. As already discussed, we will 
use the IEEE double precision format only. The system will be heavily 
optimized for the arithmetics of complex-valued numbers. 

2. it computes first approximations of several important special functions (as 
already remarked, these functionalities are called LUT operations in APE 
jargon). 

3. it performs all (mostly integer) arithmetic and logic operations needed 
to compute memory addresses. This task was carried out in a separate 
chip in APEmille, with dedicated hardware. We plan to share just one 
processor for this task and the previous one (and also for the fourth task, 
described below). As shown elsewhere, the price paid by this optimization 
in terms of performance is small. 

4. it performs all arithmetic and logic operations supporting the evaluation 
of branch conditions. All considerations made before about addressing 
also apply here. 

We plan, for obvious reasons, to reuse to a large extent the logical design 
and implementation of the arithmetic block used in APEmille (called FILU for 
Floating - Integer - Logic Unit). This goal is most easily reached by extracting 
from FILU the double precision FP data path, the integer data path and the 
LUT circuitry, and building more complex operators as combinations of these 
building blocks. We recall that the FP data path performs the normal FP 
operation (d = a x b + c) and conversions between FP and integer numbers, 
while the integer data path performs standard arithmetic and logic operations 
in integer format. We call these basic data paths the FBB (Floating Building 
Block), the IBB (Integer building Block) and the LBB (LUT Building Block). 
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Experience with the development of APEmiUe has taught us that a minor 
effort is needed to finahze the design of the IBB and LBB. Here we consider in 
details only the FBB. The architecture that we consider is shown in figure 0. It 
uses data stored in the RF, that contains 256 register pairs. The two elements of 
the pair share the same address on each of the three ports. A complex operand 
has its real and imaginary parts stored on the same word of both registers, 
while a real operand sits on any location of either block. A vector operand 
finally is made up of two independent real values, stored in the same way as a 
complex operand. Vector operations can be efficiently used in LQCD codes for 
the generation of random numbers. 
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Figure 7: Block diagram of the floating point data-path (FBB) within the math- 
ematical processor. 

The FBB (sec figure |) uses four basic floating point blocks, wired in such a 
way as to: 

• compute the complex-valued version of the normal operation: 

d.re = a.re x b.re — a.im x b.im + ere (4) 
d.im = a.re x b.im + a.im x b.re + c.im (5) 



• compute one real- valued normal operation on operands coming from any 
element of any register. 
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• compute two real-valued normal operations on ordered operand pairs sit- 
ting on the right and left register banks respectively (vector mode). 

Some basic figures of this architecture are collected in table |l3|. Control of 
the processor requires 5 opcode bits in the microcode word and 4 addresses for 
the RF ports. In total 8 x 4 + 5 = 37 control bits are needed. 



type 


performance 


operands in RF 


complex 


1600 MFlops 


256 


real 


400 MFlops 


512 


real vect. 


800 MFlops 


256 


integer 


200 Mips 


512 


int. vect. 


400 Mips 


256 



Table 13: Basic parameters of the mathematical processor. 



8.5 Performance Estimates 

We have worked out some preliminary (but for the considered processor models 
rather accurate) forecasts of the expected efficiency of the processor outlined 
in the previous sub-sections on a few computationally intensive kernels. Our 
results are shown in table ^ for two versions of the kernel of the Dirac operator 
and for the main kernel of the Lattice Boltzmann Equation (LBE) solver. 



Kernel 


normals 


APEmille 


apeNEXT 


Dirac 1 


336 


44% 


45% 


Dirac2 


336 


58% 


72% 


LBE 


870 


55% 


60% 



Table 14: Measured performance on APEmille and estimated performance of 
J&T on some critical kernels described in the text. 

The first two codes are appropriate for LQCD programs, while the last kernel 
has been used for the simulation of turbulent fluid flows on APE systems. The 
two Dirac kernels refer to a simple program written in TAO (Diracl) and to 
an highly optimized code written in assembly (Dirac2). We can safely conclude 
that the performance will be in most cases comparable or superior to APEmille. 

We are still working to make our prediction more accurate and to test the 
efficiency of the processor on a larger set of computational kernels. 
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8.6 Implementation Issues 

In table [ij we estimate the gate count of the largest logical blocks used in J&T. 
In the table we foresee a 0.18/i CMOS technology, as discussed in a previous 
section. Most values are evaluated by using appropriately scaled corresponding 
figures for APEmille and allowing large safety margins. 



what 


APEmille 


scale Factor 


apeNEXT (gates) 


apeNEXT(TOm^) 


Reg File 


200 K 


2 


400K 


5.0 


Math. Unit 


100 K 


2.5 


250K 


3.7 


Intf. 


30 K 


2 


60K 


1.0 


Data cache 





NA 


\K X 1286 


4.4 


Instr. cache 





NA 


IQK X 1286 


34 


Total 


330 K 


4.5 + cache 


TOOK + caches 


48 



Table 15: Gate count and area estimate for the main components of the ape- 
NEXT custom processor 

Power dissipation for this system is less than 2.5 W at 200 MHz (assuming 
that about 30% of the gates switch at each clock cycle) . This processor fits into 
a reasonably small die and has a relatively small pin count. 



9 The Interconnection Network 

The interconnection network is specifically tailored to the needs of LQCD sim- 
ulations. The networks supports rigid data transfers between: 

• nearest neighbour nodes in the positive and negative direction of the three 
axis (single hops) 

• next to nearest nodes, whose node-coordinate differ by ±1 in two of the 
three dimensions (double hops) 

More formally, the network performs rigid shifts of the mesh of processors onto 
itself: 

{x, y, z) [x Ax, y + Ay, z 4- Az) (6) 

where (x, y, z) labels the coordinates of each processor. The shifts (Ax, Ay, Az), 
with jAij = 0, 1 and |Az| < 2, are the same for each processor. 
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Each link has a target bandwidth of at least 300 MBytes/sec per link. Each 
node needs six links to support all the above described communication patterns. 

Prom the point of view of system architecture the network is logically syn- 
chronous and supports SIMD program flows, although at the layer of the phys- 
ical link, no (wall-clock) time synchronization is needed. 

This definition can be made more precise in the following way: 

• Consider a SIMD program started on all nodes of the machine. Each node, 
while executing the program, starts a well defined sequence of remote 
communications. The sequence is the same for all nodes. 

• we tag all remote communications by the following set of attributes: 

{Ax,Ay,Az,S,N) (7) 

where the A's were defined before, S is the size of the data packet associ- 
ated to the communication and A'^ is an identifier that labels all communi- 
cations issued by each program (in the following, we call N the message- 
tag). N is initialized at 0, when starting the program and is incremented 
every time a new communication is started. In other words, N defines 
an ordering of all communications inside the program. Note that all at- 
tributes of each remote communication are equal on all nodes. 

• The network interface of each node accepts data bound to a remote node 

and tries to send it to destination. Note that although all nodes necessarily 
send the same sequence of packets, the (wall-clock) time at which a new 
data transfer starts may differ slightly among nodes. The following simple 
protocol controls the ensuing traffic: 

1. Each network interface refuses to accept a data packet coming from a 
different node and tagged by N unless it has been already instructed 
by its own node to start transferring N. 

2. Each network interface delivers incoming data in strict ascending A'' 
order. 

This protocol is needed to make sure that all messages reach destination in 
the appropriate ordering. As we see, very simple rules are needed to reach 
this goal under the assumption that programs follow the SIMD paradigm. 

Note that the network can perform several useful sanity checks: 

• Once a node instructs the network to send a data-packet to a given des- 
tination, the network implicitly knows which packets it should expect on 



41 



NETWORK MODULE 



STX ETX 



COMPUTATION NODE 



NETWORK MODULE 









LNK 






LNK 







COMPUTATION NODE 



NETWORK MODULE 



LNK 



LNK■"^ 



COMPUTATION NODE 



Figure 8: Architectural block diagram of the interconnection network. 



its links with a given tag. For instance, if a data-packet must be routed 
to south - east, then data with the same tag is expected from west for 
delivery at the local node and from north to be routed to east. The 
network interface can check that this is actually the case. 

• The network can also check whether the right sequence of tags is received 
within a (programmable) time-out delay. 

• The network can further check whether the data-sizes of all messages as- 
sociated to a given tag are equal. 

All these checks are important to help debug either ill-functioning hardware or 
wrong programs. 
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Error rates in the network are an important issue. In plain fact, we do not 
know the Bit Error Rate (BERR) that we may expected on fast hnks. We are 
only able to quote the upper limit on the BERR implied by the safe operation 
of the APEmille prototype (10~^^). The BERR value needed for an error free 
apeNEXT operation on runs lasting a few days is an embarrassingly low 10~^^. 
For this reason, we have decided to stay on the safe side and to design a network 
partially able to recover from errors. If we are able to recover from errors, a 
much more manageable picture emerges: for instance a more modest BERR 
~ 10""'^^ implies the failure of one transmission burst every second on the entire 
machine. 
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Figure 9: Low-latency retry mechanism for the apeNEXT links. 



We are considering a retry mechanism, shown in figure |S| that has only a 
moderate impact on communication latency: 



• We divide each data block traveling on a remote link in relatively short 
bursts (say, 16 bytes) followed by a cyclic- redundancy-check (CRC). 

• Data bursts are sent from the transmitting nodes, followed by their CRC. 
A small number of bursts is kept on the transmitting node, stored inside 
a FIFO queue, also after transmission. 

• The receiving end of the link checks the CRC of each burst as it arrives. 
If the check is successful it delivers received data. The latency implied by 
this procedure is not longer than the size of each burst. 

• If an error is detected, the receiving end requests the corrupted burst to 
be retransmitted. This is possible, since relevant data is still available on 
the transmitting side of the link. 
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We plan to finalize most of the high level details of the network using a 
black box model of the physical link layer. In this way we can complete most of 
the design even before selecting the actual link technology. In parallel, we are 
already carrying out tests on some of the link technologies. 



10 Machine Assembly and Partitioning 

We plan to assemble a certain number of APEmille processors on a printed 
circuit board (PCB). Preliminary evaluations suggest that 16 processors can be 
placed on one PCB, of roughly the same size as the one used for APEmille. 
For comparison, note that one APEmille PCB houses 8 processors. In this case 
however a large (almost 50%) fraction of the real estate is used by the control 
processor and ancillary circuitry. If we use PCB's of the same size as APEmille, 
we can immediately reuse the mechanical components of the older system. 

At this point in time we have two options for the topology of the nodes 
belonging to one PCB. The first option is a three-dimensional structure with 
2x2x4 processors. The second option implies a two-dimensional set-up of 4 x 4 
processors. PCB's are assembled inside a crate. All PCB's inside one crate are 
connected to a communication backplane. If we use the mechanical components 
developed for APEmille, we can reasonably house up to 16 PCB's inside one 
crate. Larger machines use more crates. 

If we use the first option for the node topology inside a PCB, we can allocate 
onto the backplane all links in two of the three dimensions (say, directions y and 
z), building a system of size 4x8x8. Communications in the x direction are 
implemented via cable links. If we assemble and connect together n crates, we 
obtain apeNEXT systems of size (4 x n) x 8 x 8). This option is very similar 
to the one used by APEmille, where systems of size (2 x n) x 8 x 8 are allowed. 
We call this arrangement Option lA. 

It is possible to use the same structure of the PCB as above, wiring however 
the unit inside a crate according to a 4 x 4 x 16 topology. In this case large 
machines contain (4 x n) x (4 x to) x 16 nodes. This is option IB. 

In the case that the second option for the PCB is selected, we allocate 
all links belonging to one of the spatial directions (say, direction z) onto the 
backplane. Links in the x and y directions stemming out of the PCB use cable 
links instead. Using this arrangement, systems of size (4 x n) x (4 x to) x 16 
can be assembled. This is option 2. Some basic figures relevant to both options 
are collected in table |l^. 

In both large number of signal must be routed on the backplane. 
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Option lA 


Option IB 


Option 2 


number of procs. 


16 


16 


16 


Peak PCB perf. 


25.6 GFlops 


25.6 GFlops 


25.6 GFlops 


topology 


2x2x4 


2x2x4 


1x4x4 


crate topology 


2x8x8 


4 X 4 X 16 


4 X 4 X 16 


Large-systems 


(2 X n) X 8 X 8 


(4 X n) X (4 X m) X 16 


(4 X n) X (4 X to) X 16 


Remote links (PCB) 


40 


40 


48 


Remote links (BP) 


32 


24 


32 


Remote links (cables) 


8 


16 


16 



Table 16: Basic figures of three possible apeNEXT machine configurations. 

This is a serious but not formidable engineering problem. Assuming that 20 
data lines are needed per link, we have 640 pins carrying data from the PCB 
to the backplane (see again table ^6|). This requires as little as about 17 cm 
on the PCB edge, using high-density high-speed matched-impedance connectors 
developed by several vendors (see for instance [^). Of course, special care must 
be taken in the design of accurately matched transmission lines, both on the 
backplane and on the main PCB. 

As discussed earlier, the backplane must also house a CPCI backplane. This 
is made by a straightforward copy the well tested CPCI backplane developed 
for APEmille. 

We will decide later on in the design phase on the selected topology, using 
information from test setups and taking also into consideration the relative 
merits of the two solutions from the point of view of physics simulations. 



11 Software: The Programming Environment 

The apeNEXT programming environment will be initially based on two main 
lines: 

• The TAO programming language, extensively used in APEIOO and in 
APEmille, will be supported. This is necessary to allow easy and early 
migration of the large set of existing QCD programs on the new machine 
This large portfolio of programs is also going to be extremely useful for 
test and debugging purposes. 

We do not plan to make any substantial improvement to TAO. We will 
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just modify the back-end section of the TAO compiler, so it produces 
apeNEXT assembly codes. 

• We plan to develop a C/C++ language compiler for apeNEXT since 
the early phases of the project. Very few extensions will be added to 
the standard C syntax, with the goal of minimizing the effort for the 
programmer in learning a new language. SPMD parallelism will be realized 
by just a few special constructs, similar to the ones already present in TAO: 

1. the where statement executes code blocks based on local conditions. 

2. the all, none, ciny keywords in a standard C condition perform 
aggregate evaluation of local conditions. 

3. Remote communications will be specified by constant pointers. 

The compiler will be implemented by porting already available public do- 
main compilers (like, for instance, the GNU C/C-l — h compiler or Icc) with 
the needed SPMD extensions in the front-end (the language definition) 
and all necessary changes in the back-end, to produce the target assem- 
bly. A similar GNU-gcc based compiler prototype for APEmille is 
currently evaluated. 

Note that, by using already available compilers, it will be relatively easy 
to implement all SPMD extensions of the language on more traditional 
machines (like PC clusters). Conversely, already developed parallel pro- 
grams written in C (and following the SIMD/SPMD paradigm) will be 
easily ported onto apeNEXT. We regard this possibility as our main 
path to build a common programming environment between apeNEXT 
and more traditional systems. 

We want to push still further the goal of a more general apeNEXT pro- 
gramming environment. We plan to merge to some extent the programming 
environments based on Tao and on C/C++, and at the same time enhance the 
portability of programs between APE systems and more traditional computer 
clusters. We plan to work according to the lines described in figure that 
uses for definiteness the structure of the GNU compiler. 

The blue boxes in figure ^ sketchily describe the overall organization of a 
modern compiler. There is a front-end block with a configurable parser that 
transforms the user code into an internal representation, based on a tree repre- 
sentation of the code and symbol tables. The back-end block maps the internal 
representation onto assembly code for a specific target architecture. 

The APE C compiler can be implemented on the basis of existing and con- 
figurable front-ends with minor modifications to include the required syntax 
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Figure 10: A sketchy view of the internal structure of the GNU compiler, in- 
cluding planned extensions for the apeNEXT software environment. 



extensions for parallel processing. The back-end section must of course be cus- 
tomized to produce apeNEXT assembly code. 

It is also possible to add an additional parser at the front-end level. (For 
instance, Fortran is implemented in this way in the GNU compilers.) We intend 
to follow this path and to include the TAO parser, suitably modified to generate 
the GNU internal representation. Indeed, TAO cannot be easily handled by 
standard configurable parsers because of its dynamic grammar. In figure |lO| the 
APE specific extensions are drawn in red. 

When the program outlined above is accomplished, we will have a very neat 
portable environment in which: 

• all powerful optimization techniques of a standard compiler core are avail- 
able. 

• TAO and C codes can be compiled for a standard computer system (e.g. 
a PC). 

• C and TAO codes can be compiled for an APE system. 



47 



The design and implementation of this open programming environment is a 
long term and very high priority goal of our project. This is not going to be an 
easy task and can certainly not be finalized in short terms. Physics exploitation 
of apeNEXT in the early phases does not depend on this environment, since 
the traditional APE software tools can be used. 

At the machine level, we will port to the new architecture and improve the 
well-established VLIW code-scheduling and code-compressing tools already used 
in APEmille. 



12 Software: The Operating System 

We plan to shape the apeNEXT operating system as a direct evolution of the 
APEmille system: the basic idea is that we use as much as possible the services 
provided by Linux on the network of host PC's. 

• The apeNEXT operating system must load executable codes on the array 
of processing nodes and start execution of the whole system. We remind 
that all steps of program compilation and optimization are not done on 
the apeNEXT processors itself but are performed on the host PC's (or 
on any other Linux machine). 

• The second task performed by the operating system is the support for 
input-output operations rcqiicstcd by the executing program. Note here 
that these operations use the standard file systems available on the host 
PC's (or, on any networked disk server). Of course, large data transfers, 
where high bandwidth is needed, are performed in parallel by all PC's on 
local disks. Later on, we will make some additional remarks on this point. 

• The third task performed by the operating system is the monitoring and 
control of all nodes at a low level. Typical examples include the inspection 
and setting of status-registers, the analysis of error conditions, explicit 
writes or reads to memory locations. 

All functions described above are handled by the APEmille operating system 
in a reasonably efficient and user-friendly way. Most operations can be easily 
moved onto apeNEXT by rewriting only the lowest layer levels of the operating 
system, like device drivers or the functions mapping a specific operation on a 
specific node onto the appropriate PC. We expect therefore to be able to put to 
work quickly an early version of the system. 

An area where we would like to come up with new ideas, not needed how- 
ever for the early commissioning of apeNEXT, is some version of a parallel 
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file system, where large field and propagator configurations can be stored in a 
standard format. This is an obvious starting point to allow the sharing of QCD 
configurations among collaborating groups. We see this work as a partial contri- 
bution of apeNEXT to the GRID project. Work on this line will be therefore 
coordinated with GRID. 



13 Design Method 

In designing the needed VLSI components as well as the overall system, we 
want to follow the method used in APEIOO and APEIOOO, with a number of 
improvements to make it more efficient and faster. The main advantage of this 
method has been shown in APEIOO and APEmille: in both cases all components 
of the machine were designed "first- time-right". 

The main idea behind our method is some informal implementation of "hardware- 
software co-design" techniques: 

• We base our design on a VHDL model of a large and significant fraction 
of the whole system. The model contains all in-house developed systems 
as well as all off-the-shelf components. Initially, the model will be a very 
crude approximation to the actual system, gradually incorporating all de- 
tails. This reference model is available at all collaboration sites. 

• All VLSI (or FPGA based) components of the system are derived with 
high quality synthesis tools from the VHDL design. In the (hopefully 
rare) cases where some components cannot be synthesized from a VHDL 
description, a VHDL model is built anyway, and test vectors for the ac- 
tual implementation are derived by the VHDL model. Non-VLSI parts of 
the systems (i.e., processing boards) will be modelled in VHDL by their 
designers. 

• The VHDL model supports a reasonable approximation of the interaction 
with the host system (operating system). 

• All software developments arc immediately tested on the VHDL model. 
At an early stage of the design, tests will involve performance estimates 
on crudely modelled architectural choices. As the level of details of the 
model increases, actual programs, in all their intricacies, will be executed 
on the model, giving quick feed-back on any design detail. 

As an improvement with respect to APEmille, we will insist on: 
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• a continuous availability of the model at all collaboration sites, so all 
members of the collaboration can easily monitor the effects of a design 

change made elsewhere. This can be achieved with reasonable effort by 
keeping a master copy of the model on an AFS cell available from all sites. 

• an effort to allow access to the model from a basic version of the op- 
erating system, so that even the more physics-oriented members of the 
collaboration can exercise it. 

• a systematic use of blind-test procedures: at all stages of the design phase, 
test sequences for any part of the machine will be prepared and executed 
by someone who has not been directly involved in the design. 

Finally, we address the issue of the design of some VLSI blocks that depend 
critically on some element which is either not under our direct control or not 
completely defined at this stage. Examples of this situation are the details of 
the memory system (which heavily depends on the type of memory available at 
the time when the prototypes are built) for the custom processor, or the actual 
choice for the physical layer of the interconnection links. Changes made in 
memory technology during the design lifetime of APEmille have indeed adversely 
affected that project, as large subsystems within JIOOO and TIOOO processors 
had to be modified to adapt to changing memory specifications. 

We want to solve this problem by confining all details of the memory and 
link interfaces in a " grey box" , that interacts with the rest of the design with 
some simple and rather general data and control paths. In this way, a very large 
fraction of the design can be finalized independently, while the interface-specific 
blocks will be procured (if available from external sources) or designed in detail 
at the last moment. 



14 Conclusions 

This document has described physics requirements and the basic architecture 
of a next generation LQCD computer project. We think that the well-tested 
SIMD/SPMD architecture of the previous APE generations is still the best 
choice for a LQCD-focused high performance engine. At the engineering level, 
we plan to use technologies similar to those used in APEmille. We think that 
this choice reduces development costs and risks. 

In the near future, collaboration between groups active in LQCD simula- 
tions will become tighter and tighter. For this reason we plan to work hard 
on the development of a software environment allowing easy migration between 
apeNEXT and more traditional computers. 
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